Kiefer Wolfowitz Algorithm is Asymptotically Optimal for a Class of Non-Stationary Bandit Problems

نویسندگان

  • Rahul Singh
  • Taposh Banerjee
چکیده

We consider the problem of designing an allocation rule or an “online learning algorithm” for a class of bandit problems in which the set of control actions available at each time t is a convex, compact subset of R. Upon choosing an action x at time t, the algorithm obtains a noisy value of the unknown and time-varying function ft evaluated at x. The “regret” of an algorithm is the gap between its expected reward, and the reward earned by a strategy which has the knowledge of the function ft at each time t and hence chooses the action xt that maximizes ft. For this non-stationary bandit problem set-up, we propose two variants of the Kiefer Wolfowitz (KW) algorithm i) KW with fixed step-size β, and ii) KW with sliding window of length L. We show that if the number of times that the function ft varies during time t is o(T ), and if the learning rates of the proposed algorithms are chosen “optimally”, then the regret of the proposed algorithms is o(T ). I. KIEFER WOLFOWITZ ALGORITHM We begin by describing the KW algorithm for the case when the function f to be optimized is fixed, and the maximizer is denoted θ(f) ∈ D. The vanilla version of the KW algorithm maintains, at each time-step n an estimate of the function minimizer, denoted as Xn. It then makes an estimate of the derivative of the unknown function f by sampling the function values at points Xn+ cn and Xn− cn. Let Y + n , Y − n be the noisy values of the function at Xn+cn and Xn−cn respectively. Denote by Yn the estimated value of the derivative of function f at Xn. We then have that,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A companion for the Kiefer-Wolfowitz-Blum stochastic approximation algorithm

A stochastic algorithm for the recursive approximation of the location θ of a maximum of a regression function has been introduced by Kiefer and Wolfowitz (1952) in the univariate framework, and by Blum (1954) in the multivariate case. The aim of this paper is to provide a companion algorithm to the Kiefer-Wolfowitz-Blum algorithm, which allows to simultaneously recursively approximate the size...

متن کامل

General Bounds and Finite-Time Improvement for the Kiefer-Wolfowitz Stochastic Approximation Algorithm

We consider the Kiefer-Wolfowitz (KW) stochastic approximation algorithm and derive general upper bounds on its meansquared error. The bounds are established using an elementary induction argument and phrased directly in the terms of tuning sequences of the algorithm. From this we deduce the nonnecessity of one of the main assumptions imposed on the tuning sequences by Kiefer and Wolfowitz [Kie...

متن کامل

Variational Calculus in Space of Measures and Optimal Design

The paper applies abstract optimisation principles in the space of measures within the context of optimal design problems. It is shown that within this framework it is possible to treat various design criteria and constraints in a unified manner providing a “universal” variant of the Kiefer-Wolfowitz theorem and giving a full spectrum of optimality criteria for particular cases. The described s...

متن کامل

Asymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint

We develop asymptotically optimal policies for the multi armed bandit (MAB), problem, under a cost constraint. This model is applicable in situations where each sample (or activation) from a population (bandit) incurs a known bandit dependent cost. Successive samples from each population are iid random variables with unknown distribution. The objective is to have a feasible policy for deciding ...

متن کامل

Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1702.08000  شماره 

صفحات  -

تاریخ انتشار 2017